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Abstract 

We present a stochastic finite-state model for segment- 
ing Chinese text into dictionary entries and produc- 
tively derived words, and providing pronunciations for 
these words; the method incorporates a class-based 
model in its treatment of personal names. We also 
evaluate the system's performance, taking into account 
the fact that people often do not agree on a single seg- 
mentation. 

THE PROBLEM 

The initial step of any text analysis task is the tok- 
enization of the input into words. For many writing 
systems, using whitespace as a delimiter for words 
yields reasonable results. However, for Chinese and 
other systems where whitespace is not used to delimit 
words, such trivial schemes will not work. Chinese 
writing is mo rpho syllabic (DeFrancis, 1984), meaning 
that each hanzi - 'Chinese character' - (nearly always) 
represents a single syllable that is (usually) also a sin- 
gle morpheme. Since in Chinese, as in English, words 
may be polysyllabic, and since hanzi are written with 
no intervening spaces, it is not trivial to reconstruct 
which hanzi to group into words. 

While for some applications it may be possible 
to bypass the word-segmentation problem and work 
straight from hanzi, there are several reasons why this 
approach will not work in a text-to-speech (TTS) sys- 
tem for Mandarin Chinese — the primary intended 
application of our segmenter. These reasons include: 

1. Many hanzi are homographs whose pronunciation 
depends upon word affiliation. So, (fa is pronounced 
de0 l when it is a prenominal modification marker, 
but di4 in the word @ (fa mu4di4 'goal' ; ijg is nor- 
mally ganl 'dry ' , but qian2 in a person's given name. 

2. Some phonological rules depend upon correct word- 
segmentation, including Third Tone Sandhi (Shih, 
1986), which changes a 3 tone into a 2 tone be- 
fore another 3 tone: /.b^SjB xiao3 [lao3 shu3] 'lit- 

'We use pinyin transliteration with numbers representing 
tones. 



tie rat', becomes xiao3 [ lao2-shu3 ], rather than 
xiaol [ lao2-shu3 ], because the rule first applies 
within the word lao3-shu3, blocking its phrasal ap- 
plication. 

While a minimal requirement for building a Chi- 
nese word-segmenter is a dictionary, a dictionary is in- 
sufficient since there are several classes of words that 
are not generally found in dictionaries. Among these: 

1. Morphologically Derived Words: />Hf xiao3- 
jiang4-men0 (little general-plural) 'little generals'. 

2. Personal Names: zhoul enl-lai2 'Zhou 
Enlai'. 

3. Transliterated Foreign Names: TfJlfli^i^ bu4- 
Iang3-shi4-wei2-ke4 'Brunswick' . 

We present a stochastic finite-state model for seg- 
menting Chinese text into dictionary entries and words 
derived via the above-mentioned productive processes; 
as part of the treatment of personal names, we dis- 
cuss a class-based model which uses the Good-Turing 
method to estimate costs of previously unseen personal 
names. The segmenter handles the grouping of hanzi 
into words and outputs word pronunciations, with de- 
fault pronunciations for hanzi it cannot group; we focus 
here primarily on the system's ability to segment text 
appropriately (rather than on its pronunciation abili- 
ties). We evaluate various specific aspects of the seg- 
mentation, and provide an evaluation of the overall 
segmentation performance: this latter evaluation com- 
pares the performance of the system with that of several 
human judges, since even people do not agree on a sin- 
gle correct way to segment a text. 

PREVIOUS WORK 

There is a sizable literature on Chinese word segmenta- 
tion: recent reviews include (Wang et al., 1990; Wu and 
Tseng, 1993). Roughly, previous work can be classi- 
fied into purely statistical approaches (Sproat and Shih, 
1990), statistical approaches which incorporate lexical 
knowledge (Fan and Tsai, 1988; Lin et al., 1993), and 
approaches that include lexical knowledge combined 
with heuristics (Chen and Liu, 1992). 



Chen and Liu's (1992) algorithm matches words of 
an input sentence against a dictionary; in cases where 
various parses are possible, a set of heuristics is applied 
to disambiguate the analyses. Various morphological 
rules are then applied to allow for morphologically 
complex words that are not in the dictionary. Preci- 
sion and recall rates of over 99% are reported, but note 
that this covers only words that are in the dictionary: 
"the . . . statistics do not count the mistakes [that occur] 
due to the existence of derived words or proper names" 
(Chen and Liu, 1992, page 105). Lin et al. (1993) de- 
scribe a sophisticated model that includes a dictionary 
and a morphological analyzer. They also present a gen- 
eral statistical model for detecting 'unknown words' 
based on hanzi and part-of-speech sequences. How- 
ever, their unknown word model has the disadvantage 
that it does not identify a sequence of hanzi as an un- 
known word of a particular category, but merely as an 
unknown word (of indeterminate category). For an ap- 
plication like TTS, however, it is necessary to know that 
a particular sequence of hanzi is of a particular category 
because, for example, that knowledge could affect the 
pronunciation. We therefore prefer to build particular 
models for different classes of unknown words, rather 
than building a single general model. 

DICTIONARY REPRESENTATION 

The lexicon of basic words and stems is represented as a 
weighted finite-state tranducer (WFST) (Pereira et al., 
1994). Most transitions represent mappings between 
hanzi and pronunciations, and are costless. Transitions 
between orthographic words and their parts-of-speech 
are represented by e-to-category transductions and a 
unigram cost (negative log probability) of that word 
estimated from a 20M hanzi training corpus; a portion 
of the WFST is given in Figure 1 ? Besides dictionary 
words, the lexicon contains all hanzi in the Big 5 Chi- 
nese code, with their pronunciation(s), plus entries for 
other characters (e.g., roman letters, numerals, special 
symbols). 

Given this dictionary representation, recognizing a 
single Chinese word involves representing the input as 
a finite-state acceptor (FSA) where each arc is labeled 
with a single hanzi of the input. The left-restriction 
of the dictionary WFST with the input FSA contains 
all and only the (single) lexical entries correspond- 
ing to the input. This WFST includes the word costs 
on arcs transducing e to category labels. Now, input 

2 The costs are actually for strings rather than words: we 
currently lack estimates for the words themselves. We assign 
the string cost to lexical entries with the likeliest pronuncia- 
tion, and a large cost to all other entries. Thus !f§/adv, with 
the commonest pronunciation jiangl has cost 5.98, whereas 
iff/nc, with the rarer pronunciation jiang4, is assigned a high 
cost. Note also that the current model is zeroeth order in that 
it uses only unigram costs. Higher order models, e.g. bigram 
word models, could easily be incorporated into the present 
architecture if desired. 



sentences consist of one or more entries from the dic- 
tionary, and we can generalize the word recognition 
problem to the word segmentation problem, by left- 
restricting the transitive closure of the dictionary with 
the input. The result of this left-restriction is an WFST 
that gives all and only the possible analyses of the in- 
put FSA into dictionary entries. In general we do not 
want all possible analyses but rather the best analysis. 
This is obtained by computing the least-cost path in the 
output WFST. The final stage of segmentation involves 
traversing the best path, collecting into words all se- 
quences of hanzi delimited by part-of-speech-labeled 
arcs. Figure 2 shows an example of segmentation: the 
sentence B^W&lEsf&uti "How do you say octopus 
in Japanese?", consists of four words, namely 3£ 
ri4-wen2 'Japanese', zhangl-yu2 'octopus', ^jf, 

fjji£ zen3-mo 'how', and 1ft shuol 'say'. In this case, 
ri4 is also a word (e.g. a common abbreviation for 
Japan) as are ~$Cs$. wen2-zhangl 'essay', and ft yu2 
'fish', so there is (at least) one alternate analysis to be 
considered. 

MORPHOLOGICAL ANALYSIS 

The method just described segments dictionary words, 
but as noted there are several classes of words that 
should be handled that are not in the dictionary. One 
class comprises words derived by productive morpho- 
logical processes, such as plural noun formation us- 
ing the suffix f3 menO. The morphological anal- 
ysis itself can be handled using well-known tech- 
niques from finite-state morphology (Koskenniemi, 
1983; Antworth, 1990; Tzoukermann and Liberman, 
1990; Karttunen et al., 1992; Sproat, 1992); so, we 
represent the fact that f 3 attaches to nouns by allowing 
e-transitions from the final states of all noun entries, 
to the initial state of the sub-WFST representing ff3. 
However, for our purposes it is not sufficient to rep- 
resent the morphological decomposition of, say, plu- 
ral nouns: we also need an estimate of the cost of 
the resulting word. For derived words that occur in 
our corpus we can estimate these costs as we would 
the costs for an underived dictionary entry. So, H§ f 3 
jiang4-men0 '(military) generals' occurs and we esti- 
mate its cost at 15.02. But we also need an estimate 
of the probability for a non-occurring though possi- 
ble plural form like rfJJ2lff 3 nan2-gual -menO 'pump- 
kins'. Here we use the Good-Turing estimate (Baayen, 
1989; Church and Gale, 1991), whereby the aggre- 
gate probability of previously unseen members of a 
construction is estimated as N\/N, where N is the 
total number of observed tokens and N\ is the num- 
ber of types observed only once. For f 3 this gives 
prob(unseen(K) | ff3), and to get the aggregate prob- 
ability of novel f 3 -constructions in a corpus we multi- 
ply this by prob text (f 3) to get prob text (unseen(ff1)). 
Finally, to estimate the probability of particular unseen 
word j^JJ2l f 3 , we use the simple bigram backoff model 
fw*o&(ffjJ2lff3) = prob(3UH)prob text (unseen(n)); 
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Figure 1 : Partial Chinese Lexicon (NC = noun; NP = proper noun) 
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Figure 2: Input lattice (top) and two segmentations (bottom) of the sentence 'How do you say octopus in Japanese'. A 
non-optimal analysis is shown with dotted lines in the bottom frame. 
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Figure 3: An example of affixation: the plural affix 



cost(HfJ2lff l J) is computed in the obvious way. Fig- 
ure 3 shows how this model is implemented as part of 
the dictionary WFST. There is a (costless) transition 
between the NC node and fj. The transition from 
f 3 to a final state transduces e to the grammatical tag 
\PL with cost costtext(unseen(^)): cost(j^JJ2lff l J 
) = cost(\^sfH) + cost te xt(unseen(nj), as desired. 
For the seen word Ifff] 'generals', there is an e:nc 
transduction from Iff to the node preceding f]; this 
arc has cost cost(!ffff3) — cost te xt{unseen{^)), so 
that the cost of the whole path is the desired cost(fl?ff 3 
). This representation gives Iff ff 3 an appropriate mor- 
phological decomposition, preserving information that 
would be lost by simply listing Hf ff 3 as an unanalyzed 
form. Note that the backoff model assumes that there is 
a positive correlation between the frequency of a singu- 
lar noun and its plural. An analysis of nouns that occur 
both in the singular and the plural in our database re- 
veals that there is indeed a slight but significant positive 
correlation — R 2 = 0.20, p < 0.005. This suggests 
that the backoff model is as reasonable a model as we 
can use in the absence of further information about the 
expected cost of a plural form. 

CHINESE PERSONAL NAMES 

Full Chinese personal names are in one respect sim- 
ple: they are always of the form FAMILY+GI VEN . 
The FAMILY name set is restricted: there are a few 
hundred single-Zzanz; FAMILY names, and about ten 
doub\e-hanzi ones. Given names are most commonly 



two hanzi long, occasionally one-hanzi long: there 
are thus four possible name types. The difficulty is that 
GIVEN names can consist, in principle, of any hanzi or 
pair of hanzi, so the possible GIVEN names are limited 
only by the total number of hanzi, though some hanzi 
are certainly far more likely than others. For a sequence 
of hanzi that is a possible name, we wish to assign a 
probability to that sequence qua name. We use an esti- 
mate derived from (Chang et al., 1992). For example, 
given a potential name of the form Fl Gl G2, where Fl 
is a legal FAMILY name and Gl and G2 are each hanzi, 
we estimate the probability of that name as the prod- 
uct of the probability of finding any name in text; the 
probability of Fl as a FAMILY name; the probability 
of the first hanzi of a double GIVEN name being Gl; 
the probability of the second hanzi of a double GIVEN 
name being G2; and the probability of a name of the 
form SINGLE-FAMILY+DOUBLE-GIVEN. The first 
probability is estimated from a name count in a text 
database, whereas the last four probabilities are esti- 
mated from a large list of personal names. 3 This model 
is easily incorporated into the segmenter by building an 
WFST restricting the names to the four licit types, with 
costs on the arcs for any particular name summing to 
an estimate of the cost of that name. This WFST is then 
summed with the WFST implementing the dictionary 
and morphological rules, and the transitive closure of 
the resulting transducer is computed. 



3 We have two such lists, one containing about 17,000 full 
names, and another containing frequencies of hanzi in the 
various name positions, derived from a million names. 



There are two weaknesses in Chang et al.'s (1992) 
model, which we improve upon. First, the model as- 
sumes independence between the first and second hanzi 
of a double GIVEN name. Yet, some hanzi are far more 
probable in women's names than they are in men's 
names, and there is a similar list of male-oriented hanzi: 
mixing hanzi from these two lists is generally less likely 
than would be predicted by the independence model. 
As a partial solution, for pairs of hanzi that cooccur suf- 
ficiently often in our namelists, we use the estimated 
bigram cost, rather than the independence-based cost. 
The second weakness is that Chang et al. (1992) as- 
sign a uniform small cost to unseen hanzi in GIVEN 
names; but we know that some unseen hanzi are merely 
accidentally missing, whereas others are missing for a 
reason — e.g., because they have a bad connotation. 
We can address this problem by first observing that 
for many hanzi, the general 'meaning' is indicated by 
its so-called 'semantic radical'. Hanzi that share the 
same 'radical', share an easily identifiable structural 
component: the plant names If, JP and jfjj share the 
GRASS radical; malady names and W share 

the SICKNESS radical; and ratlike animal names jg,, 
§sg, andjH share the RAT radical. Some classes are bet- 
ter for names than others: in our corpora, many names 
are picked from the GRASS class, very few from the 
SICKNESS class, and none from the RAT class. We 
can thus better predict the probability of an unseen 
hanzi occurring in a name by computing a within-class 
Good-Turing estimate for each radical class. Assum- 
ing unseen objects within each class are equiprobable, 
their probabilities are given by the Good-Turing theo- 
rem as: 

E(Nf ls ) 



(1) 



where p^ s is the probability of one unseen hanzi in 
class els, E(Nf s ) is the expected number of hanzi 
in els seen once, N is the total number of hanzi, and 
E(Nq 1s ) is the expected number of unseen hanzi in 
class els. The use of the Good-Turing equation pre- 
sumes suitable estimates of the unknown expectations 
it requires. In the denominator, the Nfi ,s are well mea- 
sured by counting, and we replace the expectation by 
the observation. In the numerator, however, the counts 
of Nf s are quite irregular, including several zeros (e.g. 
RAT, none of whose members were seen). However, 
there is a strong relationship between N^ s and the 
number of hanzi in the class. For E(Nf s ), then, we 
substitute a smooth against the number of class ele- 
ments. This smooth guarantees that there are no zeroes 
estimated. The final estimating equation is then: 



N * N< h 



(2) 



The total of all these class estimates was about 10% off 
from the Turing estimate N\/N for the probability of 
all unseen hanzi, and we renormalized the estimates so 
that they would sum to N\ /N. 



This class-based model gives reasonable results: 
for six radical classes, Table 1 gives the estimated cost 
for an unseen hanzi in the class occurring as the second 
hanzi in a double GIVEN name. Note that the good 
classes JADE, GOLD and GRASS have lower costs 
than the bad classes SICKNESS, DEATH and RAT, as 
desired. 

TRANSLITERATIONS OF FOREIGN 
WORDS 

Foreign names are usually transliterated using hanzi 
whose sequential pronunciation mimics the source lan- 
guage pronunciation of the name. Since foreign names 
can be of any length, and since their original pronunci- 
ation is effectively unlimited, the identification of such 
names is tricky. Fortunately, there are only a few hun- 
dred hanzi that are particularly common in translitera- 
tions; indeed, the commonest ones, such as E bal, If 
er3, and M al are often clear indicators that a sequence 
of hanzi containing them is foreign: even a name like 
JEJltHr xia4-mi3-er3 'Shamir', which is a legal Chi- 
nese personal name, retains a foreign flavor because 
of |f . As a first step towards modeling transliterated 
names, we have collected all hanzi occurring more than 
once in the roughly 750 foreign names in our dictionary, 
and we estimate the probability of occurrence of each 
hanzi in a transliteration {p? ^ ( h a n z i{ ) ) using the max- 
imum likelihood estimate. As with personal names, 
we also derive an estimate from text of the probabil- 
ity of finding a transliterated name of any kind (ptn)- 
Finally, we model the probability of a new transliter- 
ated name as the product of p TN and pTN(hanzii) 
for each hanzii in the putative name. 4 The foreign 
name model is implemented as an WFST, which is then 
summed with the WFST implementing the dictionary, 
morphological rules, and personal names; the transitive 
closure of the resulting machine is then computed. 

EVALUATION 

In this section we present a partial evaluation of the 
current system in three parts. The first is an evaluation 
of the system's ability to mimic humans at the task of 
segmenting text into word-sized units; the second eval- 
uates the proper name identification; the third measures 
the performance on morphological analysis. To date 
we have not done a separate evaluation of foreign name 
recognition. 

Evaluation of the Segmentation as a Whole: Pre- 
vious reports on Chinese segmentation have invariably 



4 The current model is too simplistic in several respects. 
For instance, the common 'suffixes', -nia (e.g., Virginia) and 
-sia are normally transliterated as fb55 ni2-ya3 and HfS 
xil-ya3, respectively. The interdependence between ft, or 
j2f , and 55 is not captured by our model, but this could easily 
be remedied. 



Table 1: The cost as a novel GIVEN name (second position) for hanzi from various radical classes. 

JADE GOLD GRASS SICKNESS DEATH RAT 
14.98 15.52 15.76 16.25 16.30 16.42 



cited performance either in terms of a single percent- 
correct score, or else a single precision-recall pair. The 
problem with these styles of evaluation is that, as we 
shall demonstrate, even human judges do not agree 
perfectly on how to segment a given text. Thus, rather 
than give a single evaluative score, we prefer to com- 
pare the performance of our method with the judgments 
of several human subjects. To this end, we picked 100 
sentences at random containing 4372 total hanzi from 
a test corpus. We asked six native speakers — three 
from Taiwan (T1-T3), and three from the Mainland 
(M1-M3) — to segment the corpus. Since we could 
not bias the subjects towards a particular segmentation 
and did not presume linguistic sophistication on their 
part, the instructions were simple: subjects were to 
mark all places they might plausibly pause if they were 
reading the text aloud. An examination of the subjects' 
bracketings confirmed that these instructions were sat- 
isfactory in yielding plausible word-sized units. 

Various segmentation approaches were then com- 
pared with human performance: 

1 . A greedy algorithm, GR: proceed through the sen- 
tence, taking the longest match with a dictionary 
entry at each point. 

2. An 'anti-greedy' algorithm, AG: instead of the 
longest match, take the shortest match at each point. 

3. The method being described — henceforth ST. 

Two measures that can be used to compare judgments 
are: 

1. Precision. For each pair of judges consider one 
judge as the standard, computing the precision of 
the other's judgments relative to this standard. 

2. Recall. For each pair of judges, consider one judge 
as the standard, computing the recall of the other's 
judgments relative to this standard. 

Obviously, for judges J\ and J 2, taking J\ as stan- 
dard and computing the precision and recall for J 2 
yields the same results as taking J2 as the standard, 
and computing for J\, respectively, the recall and pre- 
cision. We therefore used the arithmetic mean of each 
interjudge precision-recall pair as a single measure of 
interjudge similarity. Table 2 shows these similarity 
measures. The average agreement among the human 
judges is .76, and the average agreement between ST 
and the humans is .75, or about 99% of the inter-human 
agreement. (GR is .73 or 96%.) One can better visu- 
alize the precision-recall similarity matrix by produc- 
ing from that matrix a distance matrix, computing a 
multidimensional scaling on that distance matrix, and 



plotting the first two most significant dimensions. The 
result of this is shown in Figure 4. In addition to the 
automatic methods, AG, GR and ST, just discussed, 
we also added to the plot the values for the current 
algorithm using only dictionary entries (i.e., no pro- 
ductively derived words, or names). This is to allow 
for fair comparison between the statistical method, and 
GR, which is also purely dictionary-based. As can 
be seen, GR and this 'pared-down' statistical method 
perform quite similarly, though the statistical method is 
still slightly better. AG clearly performs much less like 
humans than these methods, whereas the full statisti- 
cal algorithm, including morphological derivatives and 
names, performs most closely to humans among the 
automatic methods. It can be also seen clearly in this 
plot, two of the Taiwan speakers cluster very closely 
together, and the third Taiwan speaker is also close in 
the most significant dimension (the x axis). Two of the 
Mainlanders also cluster close together but, interest- 
ingly, not particularly close to the Taiwan speakers; the 
third Mainlander is much more similar to the Taiwan 
speakers. 

Personal Name Identification: To evaluate personal 
name identification, we randomly selected 186 sen- 
tences containing 12,000 hanzi from our test corpus, 
and segmented the text automatically, tagging personal 
names; note that for names there is always a single un- 
ambiguous answer, unlike the more general question 
of which segmentation is correct. The performance 
was 80.99% recall and 61.83% precision. Interest- 
ingly, Chang et al. reported 80.67% recall and 91.87% 
precision on an 11,000 word corpus: seemingly, our 
system finds as many names as their system, but with 
four times as many false hits. However, we have reason 
to doubt Chang et al.'s performance claims. Without 
using the same test corpus, direct comparison is ob- 
viously difficult; fortunately Chang et al. included a 
list of about 60 example sentence fragments that ex- 
emplified various categories of performance for their 
system. The performance of our system on those sen- 
tences appeared rather better than theirs. Now, on a set 
of 1 1 sentence fragments where they reported 100% re- 
call and precision for name identification, we had 80% 
precision and 73% recall. However, they listed two 
sets, one consisting of 28 fragments and the other of 22 
fragments in which they had 0% precision and recall. 
On the first of these our system had 86% precision and 
64% recall; on the second it had 19% precision and 
33% recall. Note that it is in precision that our over- 
all performance would appear to be poorer than that of 
Chang etal., yet based on their published examples, our 



Table 2: Similarity matrix for segmentation judgments 
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system appears to be doing better precisionwise. Thus 
we have some confidence that our own performance is 
at least as good that of (Chang et al., 1992). 5 

Evaluation of Morphological Analysis: In Table 3 
we present results from small test corpora for some 
productive affixes; as with names, the segmentation of 
morphologically derived words is generally either right 
or wrong. The first four affixes are so-called resultative 
affixes: they denote some property of the resultant state 
of an verb, as in iBsfiJ' wang4-bu4-liao3 (forget-not- 
attain) 'cannot forget'. The last affix is the nominal 
plural. Note that ~J* inSS^F T is normally pronounced 
as leO, but when part of a resultative it is liao3. In the 
table are the (typical) classes of words to which the affix 
attaches, the number found in the test corpus by the 
method, the number correct (with a precision measure), 
and the number missed (with a recall measure). 

CONCLUSIONS 

In this paper we have shown that good performance 
can be achieved on Chinese word segmentation by us- 
ing probabilistic methods incorporated into a uniform 
stochastic finite-state model. We believe that the ap- 
proach reported here compares favorably with other 
reported approaches, though obviously it is impossible 
to make meaningful comparisons in the absence of uni- 
form test databases for Chinese segmentation. Perhaps 
the single most important difference between our work 
and previous work is the form of the evaluation. As 
we have observed there is often no single right answer 
to word segmentation in Chinese. Therefore, claims to 
the effect that a particular algorithm gets 99% accuracy 
are meaningless without a clear definition of accuracy. 
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Figure 4: Classical metric multidimensional scaling of distance matrix, showing the two most significant dimensions. 
The percentage scores on the axis labels represent the amount of data explained by the dimension in question. 



Table 3: Performance on morphological analysis. 
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